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1.  Introduction 


1.1  Overview  of  TC AM 

Content  addressable  memory  (CAM)  is  one  of  the  most  promising  hardware  solution  for 
high-speed  data  searching  and  has  many  practical  applications  such  as  anti-virus  scanners, 
internet  protocol  (IP)  filters,  and  network  switches  [1].  Since  CAM  stores  the  content  (or 
data)  in  its  internal  memory  elements  and  compares  them  with  the  search  data  in  parallel,  it 
can  achieve  much  faster  speed  compared  to  the  software  lookup.  There  are  two  types  of 
CAM:  binary  CAM  and  ternary  CAM  (TCAM).  Especially,  TCAM  has  not  only  two  binary 
states  (‘0’  and  ‘1’)  but  also  an  additional  “don’t  care”  state  in  which  it  performs  the  wild 
match.  The  most  important  qualification  for  a  TCAM  cell  is  fast  operation  speed  for  data 
searching.  Due  to  this  reason,  static  random  access  memory  (SRAM)  has  been  widely  used  in 
memory  elements  of  the  conventional  TCAM  cell,  even  though  it  has  high  bit-cell  cost, 
typically  requiring  12-16  transistors  per  cell  as  shown  in  Figure  1  [2],  [3]. 


BL1  SL  f  S)  SL  (S )  BL2 


Figure  1:  Conventional  SRAM  based  TCAM  Cell  Architecture  Consisting  of  Two 
Volatile  Storage  Elements  and  Comparison  Logic  [3] 

However,  recent  trends  in  electronic  applications,  such  as  internet  of  things,  big  data,  wireless 
sensors,  and  mobile  devices,  have  begun  to  focus  on  the  importance  of  energy  consumption. 
The  large  SRAM-based  TCAM  cell  inevitably  increases  capacitive  loading  of  match  lines 
(MLs)  and  search  lines  (SLs),  which  in  turn  raises  dynamic  power  of  search  operation.  Also, 
as  complementary  metal-oxide-semiconductor  (CMOS)  shrinks  to  nanometer-scale,  the  other 
major  issue  has  emerged:  a  high  standby  power  due  to  leakage  current.  A  scaled-down 
channel  length  increases  the  leakage  current,  and  hence  the  use  of  SRAM  in  TCAM 
applications  is  not  a  sustainable  pathway. 

1.2  Magneto-electric  Random  Access  Memory  (MeRAM)  for  TCAM  Application 
1.2.1  MeRAM  Structure 

The  first  approach  to  achieve  a  low  power  and  high-density  TCAM  with  a  comparable 
searching  speed  is  utilizing  emerging  memory  technologies.  Although  emerging  nonvolatile 
memories,  such  as  resistive  RAM  (ReRAM),  phase-change  RAM  (PCRAM),  and  spin- 
transfer  RAM  (STT-RAM)  have  been  proposed  for  TCAM  applications  [3] — [5],  we  propose 
to  use  MeRAM  as  a  memory  element  of  TCAM  because  MeRAM  outperforms  other  memory 
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technologies  in  terms  of  speed,  energy,  and  density.  Typically,  a  MeRAM  cell  consists  of  one 
transistor  and  one  voltage-controlled  magnetic  tunnel  junction  (1T-1MTJ)  as  shown  in  Figure 
2  where  the  bottom  layer  of  the  voltage-controlled  magnetic  tunnel  junction  (VC-MTJ)  is 
connected  to  the  drain  of  the  access  transistor,  and  the  top  layer  is  connected  to  the  bit  line 
(BL).  The  size  of  the  access  transistor  in  MeRAM  can  be  reduced  further  in  that  the  voltage- 
driven  switching  ideally  does  not  require  the  flow  of  current.  Thus  the  bit  cell  array  of 
MeRAM  can  achieve  higher  density  compared  to  other  families  of  magneto-resistive  RAM 
(MRAM).  Also,  the  thickness  of  the  tunnel  barrier  is  relatively  thick,  practically  reducing 
ohmic  dissipation  during  the  write  operation. 


Figure  2:  1T-1MTJ  Cell  Architecture  in  MeRAM 

Switching  can  be  achieved  by  applying  an  electric  pulse  to  the  MTJ,  which  induces  a 
magnetic  precessional  motion  in  the  free  layer.  Since  this  type  of  switching  mechanism  is 
non-deterministic ,  a  unipolar  pulse  can  switch  either  from  AP  to  P  or  from  P  to  AP. 

1.2.2  Voltage-Controlled  Magnetic  Tunnel  Junction 

Magnetic  tunnel  junctions  (MTJs)  are  being  actively  developed  by  the  semiconductor 
industry  as  one  of  the  most  promising  memory  devices,  opening  the  door  to  new  possibilities 
of  next-generation  low-power  and  high-speed  system  architectures.  MTJs  have  compatibility 
with  CMOS  fabrication,  sufficient  on/off  ratio  (i.e.  tunneling  magneto-resistance  (TMR)), 
and  scalability.  These  characteristics  allow  MTJs  interacting  with  the-state-of-the-art  charge- 
based  electronics  on  the  same  chip  as  a  non-volatile  storage  block,  leading  to  more  compact, 
faster,  and  efficient  electronics  systems.  An  MTJ  consists  of  two  ferromagnetic  layers  (e.g. 
CoFeB)  separated  by  a  tunneling  barrier  (e.g.  MgO).  The  magnetization  of  the  pinned  (or 
fixed)  layer  should  not  be  changed  under  any  bias  conditions  to  achieve  normal  memory 
operations.  Typically,  two  stable  magnetic  equilibrium  exist  along  an  easy  magnetic  axis  in 
the  free  layer.  The  parallel  state  (P)  occurs  when  the  magnetic  moments  of  the  both  layers  are 
aligned  in  the  same  direction  giving  rise  to  a  low  resistance  ( RP );  on  the  other  hand,  the  anti¬ 
parallel  state  (AP)  occurs  when  the  magnetic  moment  of  the  free  layer  is  magnetized  in  the 
opposite  direction  to  that  of  the  pinned  layer  giving  rise  to  a  high  resistance  ( RAP ).  Note  that 
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the  MTJ  can  also  have  an  intermediate  resistance  value  between  RP  and  RAP  if  the 
magnetization  of  the  free  layer  has  transient  positions. 

In  the  perpendicularly  magnetized  MTJ,  the  magnetization  rotation  between  the  z  and  — z 
directions  occurs  via  traveling  the  in-plane  (hard  axis)  by  overcoming  the  energy  barrier  as 
shown  in  Figure  3.  One  of  the  most  promising  advantages  of  the  perpendicular  MTJ  is  that  its 
energy  barrier  is  not  dominated  by  the  geometry  of  the  device  but  by  the  perpendicular 
anisotropy.  Since  the  perpendicular  anisotropy  can  be  readily  enhanced  by  engineering  the 
interface  effect,  the  perpendicular  MTJ  has  a  better  scalability  compared  to  that  of  the  in¬ 
plane  device  governed  by  the  aspect  ratio.  Recently,  the  large  interfacial  perpendicular 
anisotropy  with  a  TMR  ratio  larger  than  100%  has  been  demonstrated  in  a  Fe-rich  CoFeB 
based  MTJ  [6].  More  importantly,  it  has  been  experimentally  observed  that  such 
perpendicular  anisotropy  can  be  modulated  by  applying  electric  fields  across  the  MTJ  device, 
resulting  in  a  low-energy  and  high-speed  voltage-controlled  switching  scheme. 


180° 

♦ 

90° . \ 

i 

0° 


0°  90°  180° 

Figure  3:  Perpendicular  MTJ  Utilizing  the  Interfacial  Perpendicular  Anisotropy  for 
Creating  Two  Equilibrium  States  by  Overcoming  the  Demagnetization  Field  in  the  z 

Directions 

Since  the  shape  anisotropy  is  not  necessary  for  forming  the  energy  barrier ,  the  geometry  of 
the  device  can  be  circular ,  leading  to  a  better  scalability. 

1.3  MeTCAM  Cell  Structure  and  Operation 

To  design  a  low-standby  power,  high-speed  accessibility,  and  low  bit-cell  cost  TCAM  cell, 
we  propose  a  MeRAM  based  TCAM,  referred  to  as  MeTCAM.  A  MeTCAM  cell  consists  of 
five  transistors  and  two  VC-MTJs,  i.e.  a  5T-2MTJ  structure,  as  shown  in  Figure  4(a).  Ml  and 
M2  are  comparison  transistors  whose  gates  are  connected  to  the  search  lines  (SL  and  SLB), 
and  the  drain  are  connected  to  the  pinned  layers  of  the  MTJs  if  Ml  and  M2  are  p-type  metal- 
oxide-semiconductor  (PMOS)  transistors.  M3  is  an  access  transistor,  which  is  required  for  the 
configuration  (write)  operation,  and  shared  by  two  storage  elements  (MTJs),  bl  and  b2,  to 
reduce  the  cell  area.  BFR  is  a  digital  inverter,  which  transfers  a  spike  to  the  OUT  node 
depending  on  the  potential  of  the  center  (CE)  node  during  the  search  operation.  The  truth 
table  of  MeTCAM  cell  for  the  search  operation  is  shown  in  Table  1. 
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Figure  4:  (a)  Schematic  and  Search  Operation  of  a  MeTCAM  Cell  and  (b)  Circuit 

Simulation  of  Search  Operation 

When  the  cell  receives  a  spike  on  enable  (EN)  node ,  the  potential  at  CE  rises.  If  the  stored 
data  matches  the  search  data ,  the  potential  at  CE  will  cross  a  threshold ,  which  in  turn 
transfers  a  spike  to  the  post- cell  via  OUT  node.  On  the  other  hand ,  in  the  mismatched  case , 
t/ie  potential  at  CE  will  be  lower  than  the  threshold  and  there  is  no  spiking  at  the  OUT. 

Table  1.  Truth  Table  of  MeTCAM  Cell  including  Don’t  Care  Condition 

(In  the  case  where  Ml  and  M2  are  PMOS) 


Stored  Data 

|bl,b2) 

(SL,  SL  ) 

OUT 

(0,1) 

(1,0) 

No  Spike  (Miss) 

0 

(0,1) 

(0.1) 

Spike  (Match) 

(1,0) 

(1.0) 

Spike  (Match) 

1 

(1,0) 

(0,1) 

No  Spike  (Miss) 

Don't  care 

(0,0) 

(1,0) 

Spike  (Match) 

(X) 

(0,0) 

(0,1) 

Spike  (Match) 

In  the  configuration  operation  (write  operation  for  MTJs),  a  two-step  write  method  is  used 
where  writing  of  the  memory  elements  bl  and  b2  is  performed  in  a  serial  manner.  The  pre- 
read  step  is  necessary  to  deal  with  the  non-deterministic  behavior  of  the  VCMA-driven 
precessional  switching  before  the  MTJ  is  written.  Typically,  a  pulse  with  -1  V  amplitude  and 
1  ns  duration  should  be  applied  to  the  MTJ  to  achieve  precessional  switching.  To  generate  the 
write  pulse  in  the  proposed  cell,  the  EN  node  is  discharged  to  ground  level,  and  Ml  or  M2 
must  be  turned  on  to  electrically  connect  between  the  CE  node  and  EN  node.  Then,  the  write 
pulse  applied  to  the  BL  is  propagated  to  the  CE  node  through  M3. 

The  reduction  of  the  cell  area  is  achieved  by  using  the  shared  access  transistor  M3  at  the 
expense  of  configuration  time.  The  increased  configuration  time  is  not  a  critical  issue  since 
the  configuration  operation  of  TCAM  applications  is  not  frequently  performed  compared  to 
the  search  operation.  Furthermore,  the  voltage-controlled  magnetic  anisotropy  (VCMA)- 
driven  precessional  switching  speed  is  fast  enough  to  compensate  for  the  increase  in  the 
number  of  the  write  operation. 

In  the  search  operation,  the  CE  node  is  initially  grounded  as  shown  in  Figure  4(b).  When  the 
cell  receives  a  spike  from  the  pre-cell  via  EN  node,  the  potential  of  the  CE  node  is 
determined  by  resistive-capacitive  (RC)  delay  originating  from  the  resistance  of  the  MTJs 
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and  the  intrinsic  capacitance  of  the  CE  node.  For  example,  consider  the  case  where  the 
MeTCAM  cell  stores  a  logic  value  ‘O’  (bl -RP  ,  b2 -RAP).  In  the  case  of  a  search  ‘O’  operation 
(SLB=0,  SL=1),  the  CE  node  can  reach  the  threshold  voltage  of  the  BFR  transistor  due  to  the 
relatively  short  RC  delay  through  the  low  resistance  of  bl-RP,  resulting  in  generation  of  a 
spike  on  OUT  node.  On  the  other  hand,  for  a  search  4 1  ’  operation  (SLB=1,  SL=0),  the 
potential  of  the  CE  node  is  unable  to  reach  the  threshold  of  BFR  due  to  the  relatively  long  RC 
delay  through  the  high  resistance  of  b2 -RAP,  causing  no  spike  on  OUT  node.  If  the  MeTCAM 
has  a  “don’t  care”  condition  (bl  -RP,  b2-RP),  the  cell  generates  a  spike  regardless  of  the 
search  data  due  to  the  short  RC  delay. 

1.4  Spike-based  Architecture  and  Operation 

The  second  approach  to  achieve  a  high-performance  and  low  power  MeTCAM  can  be 
realized  by  adopting  a  new  architecture.  The  proposed  MeTCAM  cell  receive  a  spike  on  its 
EN  node  to  activate  the  comparison  of  the  stored  data  with  the  search  data.  On  the  circuit 
level,  instead  of  connecting  all  cells  to  a  common  ML,  we  connect  the  OUT  node  of  each  cell 
to  the  EN  of  the  post  cell.  The  implementation  of  this  structure  is  shown  in  Figure  5. 

When  a  search  data  arrives,  a  spike  is  passed  to  the  EN  of  the  first  stage.  The  comparison 
result  (OUT)  of  the  first  stage  drives  the  EN  of  the  second  stage.  If  the  search  data  matches 
the  stored  data,  the  search  of  the  second  stage  is  enabled,  propagating  the  spike  to  the  second 
stage.  On  the  account  of  a  mismatch,  the  mismatched  cell  does  not  generate  a  spike  and  all  of 
the  consecutive  cells  do  not  execute  the  search  operation,  since  a  spike  will  not  arrive  at  the 
EN  of  the  next  cells.  Figure  5  shows  the  concept  of  proposed  spike-computation  based 
MeTCAM. 


|j  Matched  Cell  |  Mismatched  Cell  No  operation 

Figure  5:  Concept  of  the  Spike-Propagating  Search 

At  tO,  a  spike  is  passed  to  the  first  stage.  The  spike  propagates  to  the  next  stage  if  the  search 
data  is  a  match.  At  the  mismatch ,  the  spike  stops.  The  pattern  is  a  match  if  the  spike  arrives  at 

the  final  stage. 


By  eliminating  redundant  searches,  we  have  the  advantage  of  serial  search  in  terms  of  search 
energy/power  density.  The  actual  power  is  even  smaller,  since  we  do  not  need  to  drive  a  large 
loading  on  the  shared  ML.  The  stages  are  completely  self-enabled,  thus  each  stage  does  not 
require  a  control  system  to  decide  whether  to  disable  the  next  stage.  This  allows  the  search 
speed  to  be  only  slightly  slower  than  parallel  search. 
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Pipelining  has  been  widely  adopted  in  computing  to  increase  the  overall  throughput  of  a 
processor.  However,  conventional  TCAM  architectures  cannot  apply  pipeline  efficiently 
without  splitting  the  array,  adding  additional  registers,  additional  control  systems,  and 
degrading  the  performance  of  each  stage  due  to  the  shared  ML.  If  we  attempt  to  start  another 
search  before  the  first  one  is  finished,  the  two  search  results  will  collide  on  the  ML,  resulting 
in  failure  of  both  searches. 

On  the  other  hand,  the  spike-propagation  architecture  has  a  distinct  OUT/EN  for  each  stage, 
meaning  that  we  can  apply  pipeline  to  increase  throughput.  Figure  6  shows  how  pipelining  is 
applied  to  the  spike-propagation  architecture.  As  the  spike  passes  to  the  second  cell,  the  first 
cell  is  no  longer  occupied,  thus  we  start  a  new  search.  In  reality,  this  process  is  too  fast  to 
achieve  full  pipeline  for  every  cell  (K-stage  pipeline  for  K-bit  data).  Peripheral  circuits  need 
calculation  time  and  the  search  data  (SL)  needs  settling  time  between  each  bit  of  search  data. 
However,  it  is  possible  to  apply  pipeline  when  the  spike  has  traveled  a  sufficient  distance 
with  the  help  of  circuit  techniques.  In  this  work,  we  estimate  that  we  can  apply  pipeline  for 
every  16  cells  (i.e.  start  a  new  search  when  the  spike  has  arrived  at  the  16th  cell),  thus 
improving  overall  throughput  by  8x  for  a  128-bit  ML.  This  results  in  0.125x  average  search 
time,  and  a  search  operation  potentially  even  faster  than  completely  parallel  architectures. 
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Figure  6:  Pipelining  applied  to  the  Spike-Propagation  Architecture 

(a)  When  the  spike  is  passed  to  a  later  stage,  the  previous  stage  is  free  to  accept  another 
spike,  and  another  search  operation  is  started,  (b)  Processed  data  sets  with  respect  to  time. 
A,  B,  C,  and  D  are  different  data  sets.  The  pipeline  scheme  allows  the  array  to  fully  use  the 
resources  (MeTCAM  cells),  increasing  the  throughput  by  a  factor  ofP  where  P  is  the  number 

of  pipeline  stage. 
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2.  Calibration  of  VC-MTJ 


2.1  Physical  and  Dynamic  Model  of  VC-MTJ 

A  compact  model  accurately  capturing  the  VCMA-induced  magnetization  dynamics  is 
essential  for  successful  development  of  VCMA-based  memory.  Although  several  works 
related  to  single-domain  Landau-Lifshitz-Gilbert  (LLG)-based  macrospin  models  have  been 
reported  [7]-[10],  a  few  have  included  the  VCMA  effect  [1 1],  [12].  It  has  been 
experimentally  observed  that  the  VCMA  effect  impacts  magnetization  dynamics  both  in  the 
presence  of  STT,  as  well  as  on  its  own,  giving  rise  to  an  oscillatory  behavior  of  the  switching 
probability  as  a  function  of  applied  pulse  width,  which  differs  from  that  of  pure  STT  and 
thermally  activated  STT  switchings.  Hence,  previous  models  need  to  be  complemented  by 
incorporating  the  voltage  dependence  of  anisotropy  at  the  interface  of  the  free  layer  and  the 
tunnel  barrier. 

In  this  chapter,  we  include  the  VCMA  effect  as  a  component  of  the  effective  magnetic  field 
Heff  in  an  LLG-based  macrospin  compact  model,  allowing  implementation  in  a  hardware 
description  language  such  as  Verilog-A.  Also,  we  perform  VC-MTJ  compact  model 
parameters  calibration  based  on  measurement  results. 

We  assume  a  single-domain  MTJ  structure  where  the  three-dimensional  dynamics  of  the  free 
layer’s  magnetic  moment  m  =  {mx,  myi  mz ),  with  | m2  \  =  1,  can  be  described  via  a  LLG 
equation  in  the  presence  of  an  voltage-dependent  effective  field  He//(F )  [13]. 

where  a  is  the  material-dependent  Gilbert  damping  factor,  m  is  a  unit  vector  in  the  direction 
of  magnetization,  and  y'  is  the  reduced  gyromagnetic  ratio.  The  first  term  in  the  equation  (1) 
is  responsible  for  precessional  motion  while  the  second  term  provides  a  damping  torque  that 
makes  m  align  with  He//. 


Li  II i  _ ¥  ¥  ¥  — »  x 

—  =  -ym  x  Heff  -  ay'm  x  (m  x  Heff )  (1) 

The  effective  magnetic  field  Heff  in  the  free  layer  of  the  MTJ  is  a  function  of  the  applied 
voltage  across  the  device  due  to  the  VCMA  effect.  A  quantitative  analysis  of  voltage 
dependence  of  each  component  in  He ff  allows  extracting  the  critical  voltage  Vc,  the  minimum 
voltage  that  induces  the  voltage-driven  precessional  switching.  In  addition  to  the  voltage 
amplitude  across  the  device  (VMTj  >  Vc ),  the  pulse  duration  also  plays  a  significant  role  in 
precessional  reorientation.  To  achieve  switching,  the  pulse  needs  to  be  removed  when  the 
magnetic  moment  achieves  180°  reorientation.  If  the  pulse  duration  is  too  long  or  too  short, 
the  magnetic  moment  may  return  to  its  initial  state.  Figure  7  illustrates  the  VCMA-induced 
precessional  switching  mechanism  in  the  free  layer  of  the  MTJ. 
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Figure  7:  Illustration  of  the  Voltage-Induced  Precessional  Switching  Mechanism  in  the 
Free  Layer  of  a  Perpendicularly  Magnetized  MTJ 

(a)  Under  zero  electric  bias  condition  (VMTj  =  0  V  at  t  <  t0),  the  free  layer  is  aligned  with 

the  out-of-plane  direction  because  the  perpendicular  magnetic  anisotropy  HPMA  is  a 
dominant  component  in  H^e^.  (b)  When  an  applied  voltage  across  the  device  reduces  H^eff 
via  the  VCMA  effect ,  the  magnetic  moment  starts  to  precess  around  the  in-plane  direction,  (c) 
If  the  width  of  the  applied  pulse  is  designed  to  coincide  with  half  the  precession  period ,  a  full 
180°  switching  can  be  achieved.  Note  that  voltage  with  opposite  polarity  cannot  switch  the 

device  because  it  enhances  //^e//- 


2.2  Calibration  of  Resistance- Area  Product  and  TMR  Ratio 


Statistical  MTJ  resistance  distributions  and  resistance-area  (RA)  product  have  been  extracted 
as  a  function  of  the  device  diameter  as  shown  in  Figure  8.  Typically,  voltage-controlled  MTJ 
devices  have  a  relatively  thick  tunneling  barrier  (>1.5  nm),  resulting  in  a  high  RA  product.  In 
the  case  of  our  measurement,  the  RA  product  is  650  fl  ■  pm2.  Also,  to  obtain  high  thermal 
stability,  the  size  of  MTJs  in  MeTCAM  application  is  ranging  from  80  nm  to  120  nm,  which 
in  turn  gives  rise  to  MTJ  resistance  ( RP )  ranging  from  50  /cfi  to  130  /ed. 


Resistance  {k£i}  Resistance  (fcfiji  Juncliofl  diameter  (nm} 

(a)  [b) 

Figure  8:  (a)  MTJ  Resistance  Distributions  and  (b)  Resistance  and  Area  (RA)  Product 

as  a  Function  the  Device  Diameters 

Due  to  the  relatively  thick  tunnel  barrier  (MgO),  VC-MTJs  have  a  higher  resistance 
compared  to  a  conventional  current-driven  MTJ. 
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TMR  ratio  is  defined  by  following  equation, 


R ap  —  Rp 

TMR  =  - -  x  100  (2) 

RP 

Although  TMR  from  our  measurement  shows  50%,  it  is  possible  to  enhance  it  via  device 
optimization.  Hence,  based  on  the  measurement  data,  we  determine  the  low  and  high  resistive 
states  of  VC-MTJ  compact  model  as  either  (i)  RP  —  50  kfi.  and  RAP  —  150  kCL  (TMR=200%) 
or  (ii)  RP  =  100  kti  and  RAP  =  200  kti  (TMR=100%). 

2.3  Calibration  of  Voltage-Controlled  Precessional  Switching  Speed 

The  switching  time  of  VC-MTJ  is  equal  to  the  half  round  trip  time  of  precession  motion  of 
the  free  layer’s  magnetization.  The  precession  speed  is  mainly  determined  by  the  in-plane 
component  of  an  effective  magnetic  field  Heff  in  which  the  in-plane  magnetic  field  originates 
from  either  a  stray  field  from  another  magnetic  layer  of  the  device  or  an  externally  applied 
field.  Figure  9(a)  shows  measured  switching  probability  of  VC-MTJ  as  a  function  of  the 
applied  pulse  duration  where  the  highest  switching  probability  is  observed  between  the  pulse 
duration  0.8  ns  and  1  ns.  In  the  VC-MTJ  compact  model,  the  equivalent  switching  speed  can 
be  achieved  by  adjusting  the  in-plane  magnetic  field  between  16  kA/m  and  20  kA/m.  This 
sub-nanosecond  switching  speed  can  be  an  advantage  in  a  write  (configuration)  operation  of 
MeTCAM  application. 


Figure  9:  (a)  Measured  Probability  of  Precessional  Voltage-Induced  Switching  in  a 
Perpendicular  MTJ,  for  both  P  to  AP  and  AP  to  P  Directions  and  (b)  Transient 
Simulations  for  the  In-Plane  External  Magnetic  Field  Hext  Dependence  of  the  Switching 

Speed 

(a)  Note  the  oscillatory  dependence  of  the  write  probability  on  the  voltage  pulse  duration, 
which  is  a  signature  of  the  precessional  write  process,  (b)  As  the  Hext  increases,  the 

switching  speed  increases 
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3.  Spike-based  TCAM  Architecture  using  SRAM 


3.1  Structural  Optimization 

The  operation  sequence  of  a  TCAM  includes  enable  search-^  match/mismatch 
comparison-^  detect  output  of  the  search  results-^  return  to  initial  state.  These  operations  lead 
to  three  design  aspects  in  the  structure  of  a  SRAM-based  spiking  TCAM  cell: 

1 .  Design  of  the  comparison  path:  In  the  conventional  TCAM  cell,  the  comparison  path  is 
connected  between  the  ML  and  ground,  and  becomes  conducting  when  the  cell  is 
mismatch,  leading  to  discharge  of  the  ML.  On  the  other  hand,  the  spiking  architecture 
accepts  spikes  as  input  and  generates  a  post-cell  spike  when  a  match  occurs.  When  the 
cell  is  a  mismatch,  no  post-cell  spike  is  generated.  Thus,  the  comparison  path  should  be 
conducting  when  a  match  occurs.  Furthermore,  for  the  smallest  cell  area  configuration, 
the  output  polarity  is  inverted,  leading  to  the  need  for  a  reversed  comparison  path. 

2.  Pre-charge  and  enable:  In  the  conventional  TCAM,  the  ML  is  pre-charged  to  high  before 
a  search  operation.  The  search  operation  is  enabled  by  the  SL.  For  the  spiking  TCAM,  the 
comparison  operation  is  enabled  when  a  spike  is  received  (Men).  At  the  end  of  the  spike, 
the  cell  should  return  to  its  original  state  (Mpre). 

3.  Design  of  the  search  path:  The  spike-propagating  TCAM  (SPTCAM)  takes  advantage  of 
removing  redundant  searches  to  minimize  energy,  at  the  cost  of  a  slightly  lower  search 
speed.  To  alleviate  this  issue,  we  need  to  minimize  the  delay  of  the  search  path.  This  can 
be  achieved  by  minimizing  capacitive  charging  during  the  search  operation.  In  other 
words,  the  search  path  should  pass  through  a  minimal  amount  of  capacitive  loading,  and 
nodes  that  can  be  determined  before  the  spike  is  received  should  be  computed  early. 

We  investigate  two  cell  designs  that  fulfill  the  above  constraints,  as  shown  in  Figure  10.  In 
Figure  10(a),  the  spike  passes  through  the  cell  extremely  fast  and  as  the  delay  along  the 
propagation  path  is  minimized.  On  a  match,  nodes  on  the  comparison  path  (L,  R,  and  Nx)  are 
already  at  their  final  values  prior  to  receiving  the  spike.  However,  this  structure  potentially 
suffers  from  data-dependent  noise  with  the  search  data  is  a  mismatch.  When  the  comparison 
path  is  cut-off,  Nx  is  floating  prior  to  receiving  a  spike.  The  search  operation  will  thus  be 
affected  by  the  previous  voltage  on  Nx.  When  the  previous  search  is  a  match  and  the  current 
search  is  a  mismatch,  Nx  will  be  floating  0,  and  the  output  will  slightly  drop  despite  the 
comparison  path  being  nonconducting. 

The  structure  in  Figure  10(b)  trades  a  portion  of  speed  and  energy  for  higher  reliability. 

While  nodes  L  and  R  are  evaluated  after  the  spike  arrives,  nodes  on  the  search  path  will  be 
pre-charged  and  thus  the  output  will  not  be  affected  from  the  previous  search  operation.  This 
avoids  the  issue  of  data-dependent  noise.  We  currently  use  this  configuration  for  the 
evaluation  of  the  scheme,  however  do  not  rule  out  the  possibility  of  the  design  in  Figure  10(a) 
as  it  is  ~1.8x  better  in  terms  of  speed  and  energy. 


10 

Approved  for  public  release;  distribution  is  unlimited. 


Figure  10:  (a)  High-Speed,  Low-Energy  Cell  Structure  and  (b)  Reliable  Cell  Structure 

(a)  The  spike  propagation  path  is  minimized  and  internal  nodes  L  and  R  are  determined 
before  the  spike  arrives.  However,  this  structure  suffers  from  data-dependent  noise,  (b) 
There  will  not  be  any  data-dependent  floating  nodes  along  the  search  path.  However,  this 
comes  at  the  cost  of  lower  speed  and  increased  energy  consumption. 

3.2  Performance  Analysis 

The  key  performance  of  a  TCAM  can  be  evaluated  in  terms  of  its  speed  (delay"1)  and  its 
energy  efficiency  (energy/search)-l.  A  common  figure  of  merit  (FOM)  is  the  energy-delay 
product  (EDP,  energy" 'delay"1).  We  constructed  a  32-bit,  64-bit,  and  128-bit  TCAM  and 
analyzed  these  metrics  for  the  conventional  and  the  spiking  structures.  The  delay  is  computed 
as  the  delay  from  the  search  enable  to  the  output  being  detected.  To  find  the  average 
energy/search,  we  first  simulate  the  delay  and  energy  of  different  data  patterns:  in  the 
conventional  case,  these  include  all  match,  1-bit  mismatch,  to  all  miss;  in  the  spiking 
architecture,  they  include  all  miss,  miss  at  first  bit,  to  miss  at  last  bit  and  all  match.  The  ML 
energy  is  calculated  as  the  total  energy  consumed  on  the  shared  ML  for  the  conventional  case 
and  each  ML  for  the  spike-propagation  case.  The  SL  energy  is  the  energy  consumed  on  the 
SL  to  input  the  search  data  for  each  search  operation.  We  then  compute  the  average  energy 
consumption  assuming  a  uniform  distribution  in  the  data  (i.e.  all  patterns  have  equal 
possibility). 

Figure  1 1  shows  the  metrics  of  the  spiking  structure  compared  to  the  conventional  structure 
for  different  bit  lengths.  The  spiking  structure  suffers  from  a  ~8x  longer  delay  than  the 
conventional  structure.  However,  it  saves  27x~87x  energy  across  different  ML  lengths.  The 
energy  saving  is  particularly  pronounced  when  the  ML  length  is  long,  in  which  most  searches 
are  redundant  and  eliminated  by  the  spike-propagating  structure.  Thus,  at  a  ML  length  of 
128-bit,  the  overall  FOM  of  the  spike-propagating  architecture  is  7.5x  larger  than  that  of  the 
conventional  TCAM. 
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Figure  11:  SPTCAM  Performance  Comparison  with  Conventional  TCAM  across 

different  ML  Lengths 

(a)  Speed ,  (b)  ML  energy  efficiency ,  (c)  SL  energy  efficiency ,  and  (d)  FOM  (energy -delay 

product'1) 


3.2.1  Process-Voltage-Temperature  (PVT)  Variation 


We  first  identify  the  effect  of  global  process  variation  on  the  performance  metrics  through 
simulation  of  each  process  comer.  The  variations  are  represented  by  two-letter  designators, 
where  the  first  letter  refers  to  the  n-type  metal-oxide-semiconductor  (NMOS)  variation,  and 
the  second  letter  refers  to  the  PMOS  variation.  In  this  naming  convention,  three  variations 
exist:  typical  (T),  fast  (F)  and  slow  (S),  giving  rise  to  five  important  comers:  TT,  FF,  SS,  SF, 
and  FS.  As  shown  in  Figure  12,  the  FOM  of  the  SPTCAM  is  especially  good  in  SS  since  it 
enhances  its  low  power  advantage.  This  suggests  that  low-power  processes  benefit  the 
SPTCAM.  Conventional  TCAM  requires  high  speed,  so  optimum  performance  is  achieved  at 
the  FF  corner. 
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Figure  12:  SPTCAM  Performance  Comparison  with  Conventional  TCAM  across 

different  Process  Corners 

(a)  Speed,  (b)  ML  energy  efficiency,  and  (c)  FOM  (energy-delay  product1) 
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We  then  inspect  the  effect  of  different  voltages  by  sweeping  the  supply  voltage  from  0.6  V  to 
1.2  V  with  0.2  V  step,  as  shown  in  Figure  13.  The  nominal  VDD  of  the  process  is  IV.  As  the 
supply  voltage  increases,  the  speed  of  both  structure  increase,  and  the  energy  consumption 
also  rises.  The  search  delay  of  the  conventional  structure  is  ~10x  faster  than  the  spike- 
propagating  structure  for  a  128-bit  ML  due  to  the  long  propagation  path,  as  shown  in 
Figure  13(a).  The  average  search  energy  on  the  ML  of  SPTCAM  is  90x  lower  than  that  of 
conventional  TCAM  as  shown  in  Figure  13(b).  As  a  result,  the  EDP  of  SPTCAM  is  average 
8.5x  better  than  that  of  TCAM.  The  SPTCAM  achieves  the  best  EDP  at  VDD=0.6  V,  while 
the  conventional  TCAM  works  best  at  the  nominal  VDD. 
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Figure  13:  SPTCAM  Performance  Comparison  with  Conventional  TCAM  across 

different  Supply  Voltages 

(a)  Speed ,  (b)  ML  energy  efficiency,  and  (c)  FOM  (energy -delay  product'1) 


Lastly,  we  investigate  the  effect  of  temperature  variation  on  the  performance  of  each 
structure.  We  address  the  temperature  for  standard  benchmark  cases,  namely  -40°C,  25°C, 
and  125°C,  as  shown  in  Figure  14.  At  higher  temperatures,  performance  degrades  as  both  the 
delay  and  the  energy  consumption  increases.  The  FOM  of  the  spike-propagating  structure 
suffers  more  at  high  temperatures,  as  the  leakage  energy  begins  to  dominate  search  energy. 


Temperature  (C)  Temperature  (C)  Temperature  (C) 

(a)  (b)  (c) 

Figure  14:  SPTCAM  Performance  Comparison  with  Conventional  TCAM  across 

different  Temperatures 

(a)  Speed,  (b)  ML  energy  efficiency,  and  (c)  FOM  (energy-delay  product1) 
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3.2.2  Local  Mismatch 


The  Monte  Carlo  (MC)  method  is  a  means  of  statistical  evaluation  of  mathematical  functions 
using  random  samples.  Since  it  is  impossible  to  completely  avoid  mismatches  between 
components,  the  MC  mismatch  analysis  provides  insight  into  the  effect  of  these  variations. 
Random  variables  generate  mismatch  (i.e.  channel  length,  width,  etc.)  in  a  device  based  on 
measured  results  for  each  trial.  1000  trials  are  executed  to  find  the  3 -sigma  performance 
variation  in  both  architectures.  Figure  15  presents  the  result  of  the  MC  mismatch  simulation 
of  the  conventional  128-bit  TCAM  where  its  mean  and  standard  deviations  are  329  ps  and 
12  ps,  respectively.  The  maximum  throughput  of  the  conventional  TCAM  is  limited  by  the 
longest  delay  (361  ps),  as  shown  in  Figure  15(a).  For  the  spike-propagating  structure,  the 
maximum  throughput  is  determined  by  the  minimum  spike  width  (Figure  15(b))  and  delay 
(Figure  15(c)),  which  will  need  to  cover  variation  of  transistors.  Spike  width  smaller  than  the 
variation  may  result  in  the  spike  being  unidentifiable  at  the  output.  Delay  smaller  than  the 
variation  may  cause  spikes  to  collide.  For  reliable  operation,  the  maximum  operating 
frequency  is  the  sum  of  the  two  distributions.  This  indicates  that  the  minimum  cycle  for  the 
spike-propagating  architecture  is  thus  570ps. 


TCAM  Delay  Distribution  SPTCAM  Spike  Width  Distribution  SPTCAM  Delay  Distribution 


Delay  [s]  Spike  Width  [s]  Delay  [$] 

(a)  (b)  (c) 

Figure  15:  Local  Mismatch  Simulation  for  a  128-bit  Conventional  and  the  Spike- 

Propagating  Structure 

(a)  Conventional  TCAM  delay  distribution,  (b)  spike-propagating  TCAM  output  spike  width 
distribution,  and  (c)  spike-propagating  TCAM  delay  distribution. 

For  a  non-pipelined  architecture,  the  delay  itself  determines  the  throughput  of  the  device. 
However,  when  the  pipeline  scheme  is  applied,  the  minimum  cycle  determines  the 
throughput.  By  applying  the  pipeline  scheme,  our  new  FOM,  defined  as  (throughput  *  energy 
efficiency)  can  be  further  improved  to  a  50x  improvement  over  the  conventional  scheme. 
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4.  MeTCAM  Cell  Design 


4.1  Structural  Analysis 

For  precise  search  operation,  spikes  received  at  each  cell  should  be  shaped  such  that  the  cell 
can  distinguish  match  and  mismatch.  In  other  words,  the  potential  at  CE  (See  Figure  4(a)) 
should  reach  the  threshold  if  the  cell  is  a  match  and  remain  below  the  threshold  otherwise. 
However,  as  the  spike  propagates  along  the  search  path,  its  shape  will  inevitably  change  due 
to  the  asymmetric  slope,  delay,  and  voltage  range  between  the  rising  and  falling  edges.  This 
change  will  accumulate  along  the  search  path  and  eventually  cause  the  search  operation  to 
fail.  As  such,  we  need  to  create  different  configurations  of  the  MeTCAM  cell  and  utilize  a 
combination  of  different  cell  configurations  to  calibrate  the  spike  shape  along  the  spike 
propagation  path. 

To  design  the  path  of  spike  propagation,  we  analyze  the  delay,  acceptable  input  spike  shape 
for  correct  search  operation,  and  post-cell  spike  duration  for  a  variety  of  MeTCAM  cell 
configurations.  Two  main  configurations  without  increasing  the  MeTCAM  cell  area  are 
discussed:  the  selection  of  transistor  type  (PMOS  or  NMOS)  for  M1/M2,  and  the  placement 
of  M1/M2  with  respect  to  the  VC-MTJs  (M1/M2  connect  to  CE,  or  VC-MTJ  connected  to 
CE).  These  analyses  are  then  carried  out  for  different  input  spike  polarities  (positive  and 
negative  spikes).  A  summary  of  the  analytical  and  quantitative  results  is  shown  in  Table  2. 
The  transistor  type  mainly  affects  rising/falling  slope  and  CE  voltage  range.  The  location  of 
transistors  further  manipulates  the  slope  of  one  edge  (but  not  the  other)  through  source 
degeneration,  where  voltage  drop  across  the  VC-MTJ  during  the  search  operation  causes  the 
driving  capability  of  the  Ml  and  M2  to  reduce. 

Table  2.  Delay,  Post-Cell  Spike  Characteristics  (tspiKE  o),  and  acceptable 
Input  Spike  (tspiKE  i)  of  different  Cell  Configurations  when  Stimulated  by  Positive  and 

Negative  Spikes 

The  delay  and  tspiKE _o  values  shown  are  simulated  with  input  spike  being  in  the  center  of  the 
tspiKEj  range.  However,  they  will  slightly  vary  for  different  tspiKEj  and  different 

input  spike  shapes 


Cell  Configuration 

Rising  (0->l) 

Falling  (l->0) 

Response  to  pos.  pulse 

Response  to  neg.  pulse 

i  cE  r 

*hT.T” 

slow,  with 
degeneration 

fast 

Delay  :  127ps 
tspiKE  o*:  ~86ps 
tSPIKEJ**:  130-230ps 

Delay  :  46ps 
tspiKE_o  ■  +56ps 
tspiKEj  •  30-60ps 

■ 

slow 

Fast,  with 
degeneration 

Delay  :  78ps 
tspiKE_o  :  _25ps 
tspiKEj  ■  70-130ps 

Delay  :  65ps 
tsPIKE_0  :  +6ps 
tsPIKEJ  ■  50-110ps 

1  CE  I 

fast 

Slow,  with 
degeneration 

Delay  :  44ps 
tspiKE_o  ■  +133ps 
tspiKEj  :  30-50ps 

Delay  :  170ps 
tsPIKE_0  ■  "131ps 
tspiKEj  ■  170-290ps 

T  EN  T 

Fast,  with 
degeneration 

slow 

Delay  :  62ps 
tspiKE_o  :  +34ps 
tspiKEj  ■  50-100ps 

Delay  :  106ps 
tsPIKE_0  :  _38ps 
tspiKEj  •  100-150ps 

*Duration  of  post-cell  spike  duration  with  respect  to  the  input  spike  duration 

**  Range  of  the  input  spike  duration  for  a  cell  to  distinguish  between  match  and  mismatch 
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The  delay  of  the  first  edge  of  the  input  spike  determines  the  search  delay  of  the  cell.  On  the 
other  hand,  the  rising/falling  edge  speed  difference  translates  to  a  change  in  the  spike  shape. 
When  the  first  edge  has  a  longer  delay  then  the  second  edge,  the  spike  duration  is  shortened. 
Thus,  the  N/ZN  (P/ZP)  configuration  reduce  the  spike  duration  for  positive  (negative)  spikes. 
If  the  first  edge  of  the  spike  experiences  source  degeneration,  the  post-cell  spike  is  further 
shortened.  As  a  result,  the  N/P  (ZN/ZP)  configurations  shorten  the  spike  duration  more  (less) 
significantly.  Through  combinations  of  these  different  configurations,  the  spike  shape  along 
the  propagation  path  can  be  calibrated. 

The  slope  of  signal  development,  which  is  mainly  determined  by  the  location  of  the 
transistor,  translates  to  the  acceptable  input  pulse  duration.  The  source  degeneration 
decreases  signal  development  in  one  direction,  changing  the  required  input  pulse  duration. 
The  N  (P)  configuration  thus  has  a  significantly  longer  and  wider  input  spike  duration  for  a 
positive  (negative)  pulse  compared  to  their  ZN  (ZP)  counterparts. 

4.2  Layout  Extraction 

We  constructed  the  layout  of  each  cell  configuration  to  extract  its  parasitic  capacitances  for  a 
more  accurate  simulation  of  its  behavior  in  the  physical  chip.  The  area  of  each  cell  in  65nm 
technology  is  roughly  3um2  in  65nm  technology,  which  is  ~3x  smaller  compared  to  the 
conventional  TCAM  cell  in  the  same  technology.  An  illustration  of  the  layout  of  a  single  cell 
is  shown  in  Figure  16,  with  each  node  labeled  according  to  Figure  4(a).  The  extracted 
parasitic  capacitances  show  that  the  ZN/ZP  configuration  has  a  slightly  larger  capacitance  on 
the  CE  node,  as  the  MeRAMs  contribute  to  its  loading.  On  the  other  hand,  the  N/P 
configuration  has  a  slightly  larger  capacitance  on  the  EN  node.  SLB  has  a  larger  parasitic 
capacitance  than  SL  since  SLB  is  closer  to  the  internal  parts  of  the  circuit,  leading  to  an 
increased  capacitance  on  its  metal  lines.  A  summary  of  the  extracted  parasitic  results  are 
shown  in  Table  3.  Post-layout  simulations  of  each  cell  are  shown  in  Table  4. 


Figure  16:  Illustration  of  the  Layout  of  a  MeTCAM  Cell 

The  cell  area  occupies  3um2,  while  that  of  a  conventional  TCAM  cell  occupies  around  9um2. 
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Table  3.  Post-Layout  Parasitics  on  each  Node  along  the  Search  Path 


Node 

P 

N 

ZP 

ZN 

CE 

0.54 

0.54 

0.51 

0.51 

EN 

0.19 

0.19 

0.28 

0.28 

OUT 

0.22 

0.22 

0.22 

0.22 

SL 

0.2 

0.2 

0.2 

0.2 

SLB 

0.34 

0.34 

0.34 

0.34 

Unit:  fJ 


Table  4.  Post-Layout  Characteristics  of  each  Cell  Configuration 


Cell  Configuration 

Response  to  pos.  pulse 

Response  to  neg.  pulse 

N 

Delay  :  168ps 
tsPIKE_0*:  “H7pS 

tspiKEj**-  180-320ps 

Delay  :  61ps 
tsPIKE_0  *  +53pS 
^spikej  ■  30-80ps 

ZN 

Delay  :  99ps 

^SPIKEJD  :  "26pS 

tspiKEj  :  90-160ps 

Delay  :  81ps 

^SPIKE_0  :  +"7pS 

^spike  i  :  70-130ps 

P 

Delay  :  60ps 
tsPIKE_0  '  +83ps 
tsPiKEj  •  30-70ps 

Delay  :  228ps 
^spike_o  :  _170ps 
^spikej  :  240-390ps 

ZP 

Delay  :  79ps 
^SPIKEJD  :  +37ps 
tspiKEj  :  70-120ps 

Delay  :  135ps 

^SPIKE_0  :  "72pS 

^spikej  :  130-200ps 
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5.  MeTCAM  Array  Design 


5.1  Search  Path  Construction 

As  mentioned  earlier,  the  criteria  for  correct  MeTCAM  operation  is  that  the  post-cell  spike 
duration  of  the  Nth  cell  needs  to  be  within  the  input  pulse-width  range  of  the  N+lth  cell.  The 
post-layout  simulations  from  Section  4.2  serve  as  design  constraints  and  guidelines  for 
constructing  the  search  path.  A  variety  of  cell  configurations  are  used  to  calibrate  the  spike 
shape  as  it  propagates.  A  design  example  of  a  16-bit  spike-propagation  path  is  shown  in 
Figure  17. 


Figure  17:  Example  of  a  16-bit  Spike-Propagation  Path 

The  terminology  is  ( cell  configuration-input  polarity )  (i.e.  ZP+  represents  a  ZP  configuration 
with  a  positive  input  pulse).  The  simulated  spike  duration  after  each  cell  is  shown  on  the 

connection  between  each  MeTCAM  cell. 

We  then  construct  a  64-bit  MeTCAM  array  and  verify  its  operation  through  simulating 
patterns  of  all-match  and  mismatch  at  each  point  along  the  search  path.  The  64-bit  MeTCAM 
array  was  based  on  the  16-bit  arrays,  with  modifications  along  the  search  path  where  the 
spike  duration  falls  out  of  range.  Figure  18  shows  the  simulated  waveform  of  each  post-cell 
spike  along  the  search  path  for  (a)  all  match,  (b)  a  mismatch  at  the  1 5th  bit,  (c)  a  mismatch  at 
the  3 1st  bit,  (d)  a  mismatch  at  the  47th  bit,  and  (e)  a  mismatch  at  the  63rd  bit. 
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Figure  18:  Simulated  Waveforms  along  the  Search  Path  for  various  Search  Data 

Patterns 
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5.2  Pipeline  and  Optimization 

We  show  applicability  of  the  pipeline  scheme  by  conducting  search  operations  with  a 
frequency  greater  than  the  total  search  path  delay.  For  the  64-bit  ML,  the  delay  is  ~6.4ns. 
With  the  pipeline  scheme,  search  operation  can  be  conducted  with  a  cycle  of  Ins.  Note  that, 
at  high  search  frequencies,  the  characteristics  of  each  cell  configuration  shown  in  Table  4  will 
change.  For  example,  the  P  cell  becomes  easier  to  stimulate. 


cycle=lns 


Figure  19:  Applicability  of  Pipeline  Scheme  for  a  64-bit  MeTCAM,  where  the  Search 
Cycle  Time  can  be  as  fast  as  Ins  despite  the  Intrinsic  Delay  being  6.4ns 

Furthermore,  we  can  use  the  cell  characteristics  results  to  optimize  the  search  path  depending 
on  the  target  specifications.  For  maximum  area  efficiency,  the  spikes  should  interleave 
between  positive  and  negative  spikes,  so  that  the  buffer  size  can  be  reduced.  For  optimal 
latency,  the  total  delay  on  the  search  path  should  be  minimized;  for  maximum  throughput  the 
spike-duration  of  each  stage  should  be  as  short  as  possible.  It  is  beneficial  to  have  a  relatively 
long  spike  at  the  last  stage  for  better  detection. 
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6.  Technical  Accomplishments 


Task  1:  Design  and  simulation  environment  setup. 

(a)  Setup  the  design  platform  on  virtuoso  with  the  commercial  TSMC  65nm  process 
design  kit  (PDK)  for  mixed-signal  circuit  simulation.  Obtained  the  North  Carolina 
State  University  40nm  and  15nm  Fin  field-effect  transistor  (FinFET)  PDK  for 
evaluation  in  highly  advanced  technologies. 

(b)  Completed  the  Verilog-A  based  compact  VC-MTJ  model  compatible  with  the  design 
environment.  The  model  has  been  calibrated  with  the  most  recent  experimental  results 
on  VC-MTJ  devices. 

Task  2:  Design  of  MeTCAM  cell  and  architecture  (schematic  level)  and  verification  of 

the  circuits  based  on  the  Spectre  circuit  simulation. 

(a)  Constructed  and  optimized  the  SRAM-based  spike-propagating  architecture  and 
verified  its  performance  across  global  PVT  variations  and  local  mismatch  variations. 
Results  showed  a  ~50x  improvement  in  energy  efficiency-throughput  product. 

(b)  Designed  various  MeTCAM  configurations  and  extracted  characteristics  of  each  cell 
variety.  Constructed  the  MeTCAM  cell  layout  and  performed  post-layout  simulation 
for  accurate  physical-chip  level  evaluation. 

(c)  Identified  and  investigated  the  design  challenges  of  a  MeTCAM  array.  Then  designed 
a  64-bit  MeTCAM  array  and  verified  its  functionality,  as  well  as  applicability  of  the 
pipeline  scheme 

Task  3:  Physical  chip  layout  and  extraction  of  GDSII  file. 

(a)  Drew  the  layout  of  various  MeTCAM  cell  configurations  and  performed  verifications 
including  DRC,  LVS,  and  PEX.  The  post-layout  extracted  results  have  been 
incorporated  into  design  process. 

TASK  4:  VC-MTJ  devices  fabrication  via  back-end-of-line  (BEOL)  process. 

(a)  Fabricated  high-VCMA  VC-MTJs,  with  junction  dimensions  ranging  from  50nm  to 
1 00  nm. 

(b)  Extracted  statistical  distributions  on  the  switching  speed  and  RA  product  of  our 
fabricated  VC-MTJ  devices.  These  values  have  been  incorporated  into  our  compact 
model  and  used  in  circuit  simulation 
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List  of  Abbreviations,  Acronyms,  and  Symbols 


Acronym 

Description 

AP 

Anti-Parallel 

BEOL 

Back-End-Of-Line 

BL 

Bit  Line 

CAM 

Content  Addressable  Memory 

CE 

Center 

CMOS 

Complementary  Metal-Oxide-Semiconductor 

DARPA 

Defense  Advanced  Research  Agency 

EDP 

Energy-Delay  Product 

EN 

Enable 

FinFET 

Fin  Field-Effect  Transistor 

FOM 

Figure  of  Merit 

IP 

Internet  Protocol 

LLG 

Landau-Lifshitz-Gilbert 

MC 

Monte  Carlo 

MeRAM 

Magneto-electric  Random  Access  Memory 

MeTCAM 

Magneto-Electric  Ternary  Content  Addressable  Memory 

ML 

Match  Lines 

MRAM 

Magneto-resistive  Random  Access  Memory 

MTJ 

Magnetic  Tunnel  Junction 

NMOS 

N-type  Metal-Oxide-Semiconductor 

P 

Parallel 

PCRAM 

Phase-Change  Random  Access  Memory 

PDK 

Process  Design  Kit 

PMOS 

P-type  Metal-Oxide-Semiconductor 

PVT 

Process-Voltage-Temperature 

RA 

Resistance-Area 

RAM 

Random  Access  Memory 

RC 

Resistive-Capacitive 

ReRAM 

Resistive  Random  Access  Memory 

SL 

Search  Lines 

SPTCAM 

Spike-Propagating  Ternary  Content  Addressable  Memory 

SRAM 

Static  Random  Access  Memory 

STT-RAM 

Spin-Transfer  Random  Access  Memory 

TCAM 

Ternary  Content  Addressable  Memory 

TMR 

Tunneling  Magneto-Resistance 

VCMA 

Voltage-Controlled  Magnetic  Anisotropy 

VC-MTJ 

Voltage-Controlled  Magnetic  Tunnel  Junction 
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